Introduction

This coursework focuses on housing prices, with the main objective being to predict the price of a property based on various inputs. The inputs include features such as the area, the number and types of rooms, and additional factors like the availability of a main road, hot water heating, and more.

The dependent variable is the price, as it is the primary concern for most people searching for a house. The goal of this work is to predict the price based on diverse inputs, which consist of mixed data types, such as:

  • Numerical values
  • Text-based responses like “yes” or “no”
  • Categories for furnishing status, including “furnished,” “semi-furnished,” or “non-furnished.”

This project addresses a regression problem because the objective is to predict a numeric value—in this case, the price of the property.

Collection / Preparation

Now we are going to import our dataset into this project.

dt_houses <- fread(file = "./Datasets/Regression_set.csv")


I would like to check, if i have some nullish data in my dataset. I think it is a good idea to go through all rows and colums and check, if there is a NA. I want to check it with built-in function in R complete.cases(data_table). This function returns TRUE or FALSE if row contains a NA value.

nas <- dt_houses[!complete.cases(dt_houses)]
nas

That looks great, now we can explore our dataset :)

Exploration

Explore your data by means of select summary statistics and visualizations and present interesting findings to your reader.

Before we will explore our data, I want to import all libraries, which we will probably use:

library(data.table)
library(ggcorrplot)
library(ggExtra)
library(ggplot2)
library(ggridges)
library(ggsci)
library(ggthemes)
library(RColorBrewer)
library(svglite)
library(viridis)
library(scales)
library(rpart)
library(rpart.plot)

I found some helpful functions in R, so we could have a look on our data. We will start with a structure, than we will get some statistic data and take a head() of the data

str(dt_houses)
Classes ‘data.table’ and 'data.frame':  545 obs. of  13 variables:
 $ price           : int  13300000 12250000 12250000 12215000 11410000 10850000 10150000 10150000 9870000 9800000 ...
 $ area            : int  7420 8960 9960 7500 7420 7500 8580 16200 8100 5750 ...
 $ bedrooms        : int  4 4 3 4 4 3 4 5 4 3 ...
 $ bathrooms       : int  2 4 2 2 1 3 3 3 1 2 ...
 $ stories         : int  3 4 2 2 2 1 4 2 2 4 ...
 $ mainroad        : chr  "yes" "yes" "yes" "yes" ...
 $ guestroom       : chr  "no" "no" "no" "no" ...
 $ basement        : chr  "no" "no" "yes" "yes" ...
 $ hotwaterheating : chr  "no" "no" "no" "no" ...
 $ airconditioning : chr  "yes" "yes" "no" "yes" ...
 $ parking         : int  2 3 2 3 2 2 2 0 2 1 ...
 $ prefarea        : chr  "yes" "no" "yes" "yes" ...
 $ furnishingstatus: chr  "furnished" "furnished" "semi-furnished" "furnished" ...
 - attr(*, ".internal.selfref")=<externalptr> 


Statistic data:

summary(dt_houses[, .(price, area, bedrooms, bathrooms, stories, parking)])
     price               area          bedrooms       bathrooms        stories         parking      
 Min.   : 1750000   Min.   : 1650   Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :0.0000  
 1st Qu.: 3430000   1st Qu.: 3600   1st Qu.:2.000   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:0.0000  
 Median : 4340000   Median : 4600   Median :3.000   Median :1.000   Median :2.000   Median :0.0000  
 Mean   : 4766729   Mean   : 5151   Mean   :2.965   Mean   :1.286   Mean   :1.806   Mean   :0.6936  
 3rd Qu.: 5740000   3rd Qu.: 6360   3rd Qu.:3.000   3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:1.0000  
 Max.   :13300000   Max.   :16200   Max.   :6.000   Max.   :4.000   Max.   :4.000   Max.   :3.0000  


and this is a sample of our dataset:

head(dt_houses)

I would like to start from density of a main values, which are from my domain knowledge are important in price of the properties

We will start with price density:

ggplot(data = dt_houses, aes(x = price)) + 
  geom_density(fill="#f1b147", color="#f1b147", alpha=0.25) + 
  labs(
    x = 'Price',
    y = 'Density'
  ) +
  geom_vline(xintercept = mean(dt_houses$price), linetype="dashed") + 
  scale_x_continuous(labels = label_number(scale = 1e-6, suffix = "M")) + 
  theme_minimal() + 
  theme(axis.line = element_line(color = "#000000"))

Also it would be greate to have a look at area density:

ggplot(data = dt_houses, aes(x = area)) + 
  geom_density(fill="#f1b147", color="#f1b147", alpha=0.25) + 
  labs(
    x = 'Price',
    y = 'Density'
  ) +
  theme_minimal() + 
  theme(axis.line = element_line(color = "#000000"))


This is interesting, how does area affect price of the house. We will plot it with points, where price is on the y-axis and area on x-axis.

ggplot() + 
  geom_point(data = dt_houses, aes(x = area, y = price, color = parking)) +
  scale_y_continuous(labels = label_number(scale = 1e-6, suffix = "M")) + 
  theme_minimal() + 
  theme(axis.line = element_line(color = "#000000"))

This looks nice, and it is also logical, more space, higher price.

But, now I have the simplest idea, how does amount of bedrooms correlates with the price.

ggplot(data = dt_houses, aes(x = factor(bedrooms), y = price)) +
  geom_boxplot() + 
  theme_minimal() 

We can see, that on average, more bedrooms, means higher price, but I think there is not really strong relationship between this two variables.

Also it would be great to take a look at a bedrooms histogram:

ggplot(data = dt_houses, aes(x = bedrooms)) + 
  geom_histogram(fill="#2f9e44", color="#2f9e44", alpha=0.25) + 
  geom_vline(xintercept = mean(dt_houses$bedrooms), linetype="dashed") + 
  theme_minimal() + 
  theme(axis.line = element_line(color = "#000000"))

Also I want to show you the mean of the bedrooms:

mean(dt_houses$bedrooms)
[1] 2.965138

Here we can see, that the most of the properties tend to have 2, 3 or 4 rooms.

Let’s also have a look at density and mean value of a stories:

ggplot(data = dt_houses, aes(x = stories)) + 
  geom_histogram(fill="#2f9e44", color="#2f9e44", alpha=0.25) + 
  geom_vline(xintercept = mean(dt_houses$stories), linetype="dashed") + 
  theme_minimal() + 
  theme(axis.line = element_line(color = "#000000"))

mean(dt_houses$stories)
[1] 1.805505

It is interesting how much real estate furnished or not

ggplot(data = dt_houses, aes(x = factor(furnishingstatus), fill = factor(furnishingstatus))) + 
  geom_bar(color="#ced4da", alpha=0.25) + 
  scale_fill_viridis_d(option = "D") + 
  labs(title = "Bar Chart with Different Colors", 
       x = "Furnishing Status", 
       y = "Count") + 
  theme_minimal() + 
  theme(axis.line = element_line(color = "#000000"))

Now, it would be great, to look at price and area distribution in differently furnished properties

ggplot(data = dt_houses, aes(y = price, x = area)) + 
  geom_point(data = dt_houses, aes(y = price, x = area, color = bedrooms)) +
  geom_hline(yintercept = mean(dt_houses$price), linetype='dashed') + 
  facet_grid(.~furnishingstatus) +
  scale_y_continuous(labels = label_number(scale = 1e-6, suffix = "M")) +
  scale_color_distiller(type = "seq", palette = "Greens") +
  theme_minimal() + 
  theme(axis.line = element_line(color = "#000000"))

We can also take a look on some pie charts:


dt_mainroad_counts <- as.data.frame(table(dt_houses$mainroad)) #table() - creates frequency table
colnames(dt_mainroad_counts) <- c("mainroad_status", "count")
dt_mainroad_counts$percentage <- round(dt_mainroad_counts$count / sum(dt_mainroad_counts$count) * 100, 1)

ggplot(data = dt_mainroad_counts, aes(x = "", y = count, fill = mainroad_status)) +
  geom_bar(stat = "identity", width = 1, color = "white") +
  coord_polar("y", start = 0) +
  geom_text(aes(label = paste0(percentage, "%")), 
            position = position_stack(vjust = 0.5), color = "white", size = 4) +  
  theme_void() +  
  scale_fill_manual(values = c("#F1B147", "#47B1F1")) + 
  labs(
    title = "Distribution of Mainroad Status",
    fill = "Mainroad Status"
  )

I think that would be enough exlporation and we can start with our first model.

Models 1 & 2

Run two regression or classification models with a minimum of 5 (identical!) inputs and evaluate them in detail: Compare their performance on your data (appropriate performance metric, performance on specific regions of the input/output space) and identify potential problems/shortcomings.

If you tune a model (e.g. threshold of a logistic regression) for some metric, use only the final tuned version in the comparison with the other model.

First, I would like to start pretty simple with linear model.

I consider to take this variables in my model: area, bedrooms, bathrooms, hotwaterheating, airconditioning, stories, mainroad, parking and furnishingstatus.

Linear model

I will use lm function in R to find needed beta coefficients and create my model

price_lm <- lm(formula = price ~ area + bedrooms + hotwaterheating + airconditioning + stories + mainroad + parking + furnishingstatus + bathrooms, data = dt_houses)

summary(price_lm)

Call:
lm(formula = price ~ area + bedrooms + hotwaterheating + airconditioning + 
    stories + mainroad + parking + furnishingstatus + bathrooms, 
    data = dt_houses)

Residuals:
     Min       1Q   Median       3Q      Max 
-2632747  -712077   -26462   522681  5300066 

Coefficients:
                                Estimate Std. Error t value Pr(>|t|)    
(Intercept)                      10359.2   278618.7   0.037 0.970355    
area                               269.2       25.3  10.640  < 2e-16 ***
bedrooms                        178658.3    76119.4   2.347 0.019285 *  
hotwaterheatingyes              788761.1   236265.3   3.338 0.000901 ***
airconditioningyes              949352.9   114199.3   8.313 7.77e-16 ***
stories                         373570.2    65275.8   5.723 1.75e-08 ***
mainroadyes                     586360.5   149172.3   3.931 9.58e-05 ***
parking                         261131.8    61971.0   4.214 2.95e-05 ***
furnishingstatussemi-furnished  -91500.4   123437.7  -0.741 0.458857    
furnishingstatusunfurnished    -509693.4   133084.7  -3.830 0.000143 ***
bathrooms                      1049426.9   108784.1   9.647  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1132000 on 534 degrees of freedom
Multiple R-squared:  0.6402,    Adjusted R-squared:  0.6335 
F-statistic: 95.03 on 10 and 534 DF,  p-value: < 2.2e-16

We got 0.64 R-squared, which is not that bad for a model just made up. But that’s not all, I will try to do better here, but first, another model.

price_lm_mse <- mean(price_lm$residuals^2)

price_lm_mse
[1] 1.256325e+12

Tree Model

I think this model could perform better, because there some variables which can affect this model not only linearly, but the other way, in this case tree model can show better performance

prices_tree <- rpart(data = dt_houses, formula = price ~ area + bedrooms + hotwaterheating + airconditioning + stories + mainroad + parking + furnishingstatus + bathrooms, method = 'anova')

prp(prices_tree, digits = -3)

printcp(prices_tree)

Regression tree:
rpart(formula = price ~ area + bedrooms + hotwaterheating + airconditioning + 
    stories + mainroad + parking + furnishingstatus + bathrooms, 
    data = dt_houses, method = "anova")

Variables actually used in tree construction:
[1] airconditioning  area             bathrooms        furnishingstatus parking         

Root node error: 1.9032e+15/545 = 3.4921e+12

n= 545 

         CP nsplit rel error  xerror     xstd
1  0.304946      0   1.00000 1.00109 0.085049
2  0.094553      1   0.69505 0.71335 0.063159
3  0.053743      2   0.60050 0.61804 0.054462
4  0.026381      3   0.54676 0.58900 0.051441
5  0.024922      4   0.52038 0.58900 0.051443
6  0.022993      5   0.49546 0.58033 0.050822
7  0.021374      6   0.47246 0.55537 0.049904
8  0.015261      7   0.45109 0.54872 0.048615
9  0.012386      8   0.43583 0.52496 0.046999
10 0.010000      9   0.42344 0.52235 0.046004
prices_tree
n= 545 

node), split, n, deviance, yval
      * denotes terminal node

 1) root 545 1.903208e+15 4766729  
   2) area< 5954 361 6.066751e+14 4029993  
     4) bathrooms< 1.5 293 3.297298e+14 3773561  
       8) area< 4016 174 1.437122e+14 3431227  
        16) furnishingstatus=unfurnished 78 4.036605e+13 2977962 *
        17) furnishingstatus=furnished,semi-furnished 96 7.430067e+13 3799505 *
       9) area>=4016 119 1.358098e+14 4274118 *
     5) bathrooms>=1.5 68 1.746610e+14 5134912  
      10) airconditioning=no 44 7.024826e+13 4563682 *
      11) airconditioning=yes 24 6.373358e+13 6182167 *
   3) area>=5954 184 7.161564e+14 6212174  
     6) bathrooms< 1.5 108 2.869179e+14 5382579  
      12) airconditioning=no 65 1.170629e+14 4843569 *
      13) airconditioning=yes 43 1.224240e+14 6197360 *
     7) bathrooms>=1.5 76 2.492851e+14 7391072  
      14) parking< 1.5 51 7.184700e+13 6859794 *
      15) parking>=1.5 25 1.336772e+14 8474878  
        30) airconditioning=no 10 5.146311e+13 7285600 *
        31) airconditioning=yes 15 5.864106e+13 9267729 *
plotcp(prices_tree)

prices_tree_min_cp <- prices_tree$cptable[which.min(prices_tree$cptable[, "xerror"]), "CP"]
model_tree <- prune(prices_tree, cp = prices_tree_min_cp )
prp(prices_tree,digits = -3)

prices_tree_pred <- predict(prices_tree, dt_houses[, c("area","bathrooms", "bedrooms", "hotwaterheating", "airconditioning", "parking", "stories", "mainroad", "furnishingstatus")])
prices_tree_mse <- mean((dt_houses$price - prices_tree_pred)^2)

prices_tree_mse
[1] 1.478709e+12

Comparing two models

Feature Engineering

Now I would like to upgrade my Linear model. I think that furnishing status should be treated as a factor variable, so I am going to try to upgrade my model through factor variable:

Now try to run the model with a new feature.

price_lm <- lm(formula = price ~ area + bedrooms + hotwaterheating + airconditioning + stories + mainroad + parking + furnishingstatus_factor + bathrooms, data = dt_houses)

summary(price_lm)

Call:
lm(formula = price ~ area + bedrooms + hotwaterheating + airconditioning + 
    stories + mainroad + parking + furnishingstatus_factor + 
    bathrooms, data = dt_houses)

Residuals:
     Min       1Q   Median       3Q      Max 
-2632747  -712077   -26462   522681  5300066 

Coefficients:
                                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)                             10359.2   278618.7   0.037 0.970355    
area                                      269.2       25.3  10.640  < 2e-16 ***
bedrooms                               178658.3    76119.4   2.347 0.019285 *  
hotwaterheatingyes                     788761.1   236265.3   3.338 0.000901 ***
airconditioningyes                     949352.9   114199.3   8.313 7.77e-16 ***
stories                                373570.2    65275.8   5.723 1.75e-08 ***
mainroadyes                            586360.5   149172.3   3.931 9.58e-05 ***
parking                                261131.8    61971.0   4.214 2.95e-05 ***
furnishingstatus_factorsemi-furnished  -91500.4   123437.7  -0.741 0.458857    
furnishingstatus_factorunfurnished    -509693.4   133084.7  -3.830 0.000143 ***
bathrooms                             1049426.9   108784.1   9.647  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1132000 on 534 degrees of freedom
Multiple R-squared:  0.6402,    Adjusted R-squared:  0.6335 
F-statistic: 95.03 on 10 and 534 DF,  p-value: < 2.2e-16

Engineer a minimum of two new features based on your data exploration or on theoretical considerations. Add these features to your models and reevaluate their performance on the same performance metrics as before.


---
title: "Coursework - Data Science I"
author: "Omar Zhadykov, 220220503"
output:
  html_notebook:
    fig_width: 10
    theme: spacelab
    toc: yes
    toc_depth: 3
    toc_float: yes
  word_document:
    toc: yes
    toc_depth: '3'
  pdf_document: default
  html_document:
    fig_width: 10
    theme: spacelab
    toc: yes
    toc_depth: 3
    toc_float: yes
---

<script>
$(document).ready(function() {
  $items = $('div#TOC li');
  $items.each(function(idx) {
    num_ul = $(this).parentsUntil('#TOC').length;
    $(this).css({'text-indent': num_ul * 10, 'padding-left': 0});
  });

});
</script>

```{r setup, warning=FALSE, message=FALSE, echo=FALSE}
library(svglite)
library(knitr)
suppressPackageStartupMessages(library(data.table))
library(ggplot2)
knitr::opts_chunk$set(dev = "svglite")

# Put your dataset in the same folder as your R file. This code will set your working directory for this notebook to the folder where the R file is stored. This way I can rerun your code without modifications.

library(rstudioapi)
setwd(dirname(getActiveDocumentContext()$path))
```

# Introduction

This coursework focuses on housing prices, with the main objective being to predict the price of a property based on various inputs. The inputs include features such as the area, the number and types of rooms, and additional factors like the availability of a main road, hot water heating, and more.

The dependent variable is the price, as it is the primary concern for most people searching for a house. The goal of this work is to predict the price based on diverse inputs, which consist of mixed data types, such as:

  - Numerical values
  - Text-based responses like "yes" or "no"
  - Categories for furnishing status, including "furnished," "semi-furnished," or "non-furnished."

This project addresses a regression problem because the objective is to predict a numeric value—in this case, the price of the property.

# Collection / Preparation 

Now we are going to import our dataset into this project.

```{r}
dt_houses <- fread(file = "./Datasets/Regression_set.csv")
```

<br>
I would like to check, if i have some nullish data in my dataset. I think it is a good idea to go through all rows and colums and check, if there is a NA. I want to check it with built-in function in R *complete.cases(data_table)*. This function returns TRUE or FALSE if row contains a NA value.

```{r}
nas <- dt_houses[!complete.cases(dt_houses)]
nas
```

That looks great, now we can explore our dataset :)

# Exploration

Explore your data by means of select summary statistics and visualizations and present interesting findings to your reader. 

Before we will explore our data, I want to import all libraries, which we will probably use:

```{r}
library(data.table)
library(ggcorrplot)
library(ggExtra)
library(ggplot2)
library(ggridges)
library(ggsci)
library(ggthemes)
library(RColorBrewer)
library(svglite)
library(viridis)
library(scales)
library(rpart)
library(rpart.plot)
```

I found some helpful functions in R, so we could have a look on our data. We will start with a structure, than we will get some statistic data and take a *head()* of the data

```{r}
str(dt_houses)
```
<br>
Statistic data:
```{r}
summary(dt_houses[, .(price, area, bedrooms, bathrooms, stories, parking)])
```

<br>
and this is a sample of our dataset:

```{r}
head(dt_houses)
```

I would like to start from density of a main values, which are from my domain knowledge are important in price of the properties

We will start with price density: 

```{r}
ggplot(data = dt_houses, aes(x = price)) + 
  geom_density(fill="#f1b147", color="#f1b147", alpha=0.25) + 
  labs(
    x = 'Price',
    y = 'Density'
  ) +
  geom_vline(xintercept = mean(dt_houses$price), linetype="dashed") + 
  scale_x_continuous(labels = label_number(scale = 1e-6, suffix = "M")) + 
  theme_minimal() + 
  theme(axis.line = element_line(color = "#000000"))
```

Also it would be greate to have a look at area density:

```{r}
ggplot(data = dt_houses, aes(x = area)) + 
  geom_density(fill="#f1b147", color="#f1b147", alpha=0.25) + 
  labs(
    x = 'Price',
    y = 'Density'
  ) +
  theme_minimal() + 
  theme(axis.line = element_line(color = "#000000"))
```


<br>
This is interesting, how does area affect price of the house. We will plot it with points, where price is on the y-axis and area on x-axis.

```{r}
ggplot() + 
  geom_point(data = dt_houses, aes(x = area, y = price, color = parking)) +
  scale_y_continuous(labels = label_number(scale = 1e-6, suffix = "M")) + 
  theme_minimal() + 
  theme(axis.line = element_line(color = "#000000"))
```

This looks nice, and it is also logical, more space, higher price.

But, now I have the simplest idea, how does amount of bedrooms correlates with the price.

```{r}
ggplot(data = dt_houses, aes(x = factor(bedrooms), y = price)) +
  geom_boxplot() + 
  theme_minimal() 
```

We can see, that on average, more bedrooms, means higher price, but I think there is not really strong relationship between this two variables.

Also it would be great to take a look at a bedrooms histogram:

```{r}
ggplot(data = dt_houses, aes(x = bedrooms)) + 
  geom_histogram(fill="#2f9e44", color="#2f9e44", alpha=0.25) + 
  geom_vline(xintercept = mean(dt_houses$bedrooms), linetype="dashed") + 
  theme_minimal() + 
  theme(axis.line = element_line(color = "#000000"))
```
Also I want to show you the mean of the bedrooms:
```{r}
mean(dt_houses$bedrooms)
```


Here we can see, that the most of the properties tend to have 2, 3 or 4 rooms. 

Let's also have a look at density and mean value of a stories:

```{r}
ggplot(data = dt_houses, aes(x = stories)) + 
  geom_histogram(fill="#2f9e44", color="#2f9e44", alpha=0.25) + 
  geom_vline(xintercept = mean(dt_houses$stories), linetype="dashed") + 
  theme_minimal() + 
  theme(axis.line = element_line(color = "#000000"))
```

```{r}
mean(dt_houses$stories)
```

It is interesting how much real estate furnished or not

```{r}
ggplot(data = dt_houses, aes(x = factor(furnishingstatus), fill = factor(furnishingstatus))) + 
  geom_bar(color="#ced4da", alpha=0.25) + 
  scale_fill_viridis_d(option = "D") + 
  labs(title = "Bar Chart with Different Colors", 
       x = "Furnishing Status", 
       y = "Count") + 
  theme_minimal() + 
  theme(axis.line = element_line(color = "#000000"))
```

Now, it would be great, to look at price and area distribution in differently furnished properties


```{r}
ggplot(data = dt_houses, aes(y = price, x = area)) + 
  geom_point(data = dt_houses, aes(y = price, x = area, color = bedrooms)) +
  geom_hline(yintercept = mean(dt_houses$price), linetype='dashed') + 
  facet_grid(.~furnishingstatus) +
  scale_y_continuous(labels = label_number(scale = 1e-6, suffix = "M")) +
  scale_color_distiller(type = "seq", palette = "Greens") +
  theme_minimal() + 
  theme(axis.line = element_line(color = "#000000"))
```

We can also take a look on some pie charts:

```{r}

dt_mainroad_counts <- as.data.frame(table(dt_houses$mainroad)) #table() - creates frequency table
colnames(dt_mainroad_counts) <- c("mainroad_status", "count")
dt_mainroad_counts$percentage <- round(dt_mainroad_counts$count / sum(dt_mainroad_counts$count) * 100, 1)

ggplot(data = dt_mainroad_counts, aes(x = "", y = count, fill = mainroad_status)) +
  geom_bar(stat = "identity", width = 1, color = "white") +
  coord_polar("y", start = 0) +
  geom_text(aes(label = paste0(percentage, "%")), 
            position = position_stack(vjust = 0.5), color = "white", size = 4) +  
  theme_void() +  
  scale_fill_manual(values = c("#F1B147", "#47B1F1")) + 
  labs(
    title = "Distribution of Mainroad Status",
    fill = "Mainroad Status"
  )

```


I think that would be enough exlporation and we can start with our first model.

# Models 1 & 2

Run two regression or classification models with a minimum of 5 (identical!) inputs and evaluate them in detail: Compare their performance on your data (appropriate performance metric, performance on specific regions of the input/output space) and identify potential problems/shortcomings. 

If you tune a model (e.g. threshold of a logistic regression) for some metric, use only the final tuned version in the comparison with the other model. 

First, I would like to start pretty simple with linear model.

I consider to take this variables in my model: area, bedrooms, bathrooms, hotwaterheating, airconditioning, stories, mainroad, parking and furnishingstatus.

## Linear model

I will use lm function in R to find needed beta coefficients and create my model

```{r}
price_lm <- lm(formula = price ~ area + bedrooms + hotwaterheating + airconditioning + stories + mainroad + parking + furnishingstatus + bathrooms, data = dt_houses)

summary(price_lm)
```

We got 0.64 R-squared, which is not that bad for a model just made up. But that's not all, I will try to do better here, but first, another model.

```{r}
price_lm_mse <- mean(price_lm$residuals^2)

price_lm_mse
```


## Tree Model

I think this model could perform better, because there some variables which can affect this model not only linearly, but the other way, in this case tree model can show better performance

```{r}
prices_tree <- rpart(data = dt_houses, formula = price ~ area + bedrooms + hotwaterheating + airconditioning + stories + mainroad + parking + furnishingstatus + bathrooms, method = 'anova')

prp(prices_tree, digits = -3)
```

```{r}
printcp(prices_tree)
```

```{r}
prices_tree
```

```{r}
plotcp(prices_tree)
```


```{r}
prices_tree_min_cp <- prices_tree$cptable[which.min(prices_tree$cptable[, "xerror"]), "CP"]
model_tree <- prune(prices_tree, cp = prices_tree_min_cp )
prp(prices_tree,digits = -3)
```


```{r}
prices_tree_pred <- predict(prices_tree, dt_houses[, c("area","bathrooms", "bedrooms", "hotwaterheating", "airconditioning", "parking", "stories", "mainroad", "furnishingstatus")])
prices_tree_mse <- mean((dt_houses$price - prices_tree_pred)^2)

prices_tree_mse
```


## Comparing two models


# Feature Engineering

Now I would like to upgrade my Linear model. I think that furnishing status should be treated as a factor variable, so I am going to try to upgrade my model through factor variable:

```{r}
dt_houses$furnishingstatus_factor <- factor(dt_houses$furnishingstatus)
```


Now try to run the model with a new feature.

```{r}
price_lm <- lm(formula = price ~ area + bedrooms + hotwaterheating + airconditioning + stories + mainroad + parking + furnishingstatus_factor + bathrooms, data = dt_houses)

summary(price_lm)
```


Engineer a minimum of two new features based on your data exploration or on theoretical considerations. Add these features to your models and reevaluate their performance on the same performance metrics as before.

***




